Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Only delete compcomm on icomm if it exists #715

Closed
wants to merge 1 commit into from

Conversation

angus-g
Copy link

@angus-g angus-g commented Feb 5, 2024

Within the MPI implementation, an inner communicator can optionally contain a compilation commucator. An indentation error meant that the compilation attribute was unconditionally deleted, causing incorrect MPI behaviour if the attribute had never been set in the first place. On MPICH, this doesn't seem to raise any errors, but with OpenMPI it causes errors like:

mpi4py.MPI.Exception: MPI_ERR_OTHER: known error not in list

Within the MPI implementation, an inner communicator can optionally
contain a compilation commucator. An indentation error meant that the
compilation attribute was unconditionally deleted, causing incorrect
MPI behaviour if the attribute had never been set in the first
place. On MPICH, this doesn't seem to raise any errors, but with
OpenMPI it causes errors like:

    mpi4py.MPI.Exception: MPI_ERR_OTHER: known error not in list
Copy link
Collaborator

@JDBetteridge JDBetteridge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this looks like it was a mistake. Nice catch!

I will wait for the result of firedrakeproject/firedrake#3388 before merging, just to make sure nothing breaks.

@angus-g
Copy link
Author

angus-g commented Feb 6, 2024

I think there's something else hiding out under here. All the tests pass, but several of them try to clean up the null communicator (you can also see that on the master tests, so it wasn't introduced by this PR). I don't know what causes it, but one single-threaded example I found in my Docker container is test_riesz[Averaging-2-N1curl], which gives the following in the post-test summary:

PyOP2 Finalizing
Calling gc.collect()
STATE0
PYOP2 Communicator reference counts:
| Communicator name                      | Count |
==================================================
| PYOP2_COMM_WORLD_DUP                   |     1 |
| PYOP2_COMM_WORLD_DUP_COMPILATION       |     2 |
| -2080374781_DUP                        |     1 |
| -2080374781_DUP_COMPILATION            |     2 |

Freeing PYOP2_COMM_WORLD
STATE1
PYOP2 Communicator reference counts:
| Communicator name                      | Count |
==================================================
| -2080374781_DUP                        |     1 |
| -2080374781_DUP_COMPILATION            |     2 |

Freeing PYOP2_COMM_SELF
STATE2
PYOP2 Communicator reference counts:
| Communicator name                      | Count |
==================================================
| -2080374781_DUP                        |     1 |
| -2080374781_DUP_COMPILATION            |     2 |

Freeing comms in list (length 2)
Freeing -2080374781_DUP_COMPILATION, with index 3, which has refcount 2
Freeing -2080374781_DUP, with index 2, which has refcount 1
Traceback (most recent call last):
  File "mpi4py/MPI/attrimpl.pxi", line 131, in mpi4py.MPI.__pyx_fuse_1PyMPI_attr_delete_cb
  File "mpi4py/MPI/attrimpl.pxi", line 96, in mpi4py.MPI.PyMPI_attr_delete
  File "mpi4py/MPI/attrimpl.pxi", line 48, in mpi4py.MPI.PyMPI_attr_call
  File "/home/firedrake/firedrake/src/PyOP2/pyop2/mpi.py", line 191, in delcomm_outer
    ocomm = icomm.Get_attr(outercomm_keyval)
  File "mpi4py/MPI/Comm.pyx", line 1262, in mpi4py.MPI.Comm.Get_attr
mpi4py.MPI.Exception: Invalid communicator, error stack:
MPII_Comm_get_attr(85): MPI_Comm_get_attr(MPI_COMM_NULL, keyval=0xa440000a, attribute_val=0x7ffe753cd388, flag=0x7ffe753cd384) failed
MPII_Comm_get_attr(46): Null communicator

This shows the strange -2080374781_DUP communicator has become MPI_COMM_NULL at some point before finalisation. I've also checked that this occurs outside of the pytest context.

@JDBetteridge
Copy link
Collaborator

Okay, I will try and give some context for what I believe to be going on here:

This diagram/essay helps a bit: https://github.com/OP2/PyOP2/blob/master/pyop2/mpi.py#L85-L160

When a user creates a comm(unicator) and uses it to create a Firedrake object it is duplicated by the PyOP2 layer to avoid communication clash (a similar model to what PETSc does). The user is responsible for this comm and the responsible user will clean that comm before the process terminates.

The issue that causes these warning can stem from a user not cleaning the comm and having Python clean it up for them, which interferes with PyOP2's clean up process. One option here is to ignore COMM_NULL comms at clean up time, rather than throw an ugly error. Of course this then hides where we are accidentally doing this internally.

The mysterious -2080374781_DUP, I believe is a PETSc comm. In some cases the comm is fetched straight from a PETSc "object" and the comm is promoted to an mpi4py comm. If the lifetime of that comm is not managed correctly it is not cleaned up and causes the error you see. We really are far from perfect when it comes to comm management, but we're a lot better than we were a couple of years ago! The name is a result of converting the comm to a string for the purposes of debugging, COMM_WORLD and COMM_SELF have nice names to append things to, but comms from PETSc just have pointers and for debugging purposes I want to know which of these have been duplicated.

I will investigate the specific test case you have found, there is possibly an error somewhere in the test.

@connorjward
Copy link
Collaborator

Closing as this change has already been merged separately.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants